
MSSC 6250 Statistical Machine Learning
When \(p \gg n\), it is often that many of the features in the model are not associated with the response.
Model Interpretability: By removing irrelevant features \(X_j\)s, i.e., setting the corresponding \(\beta_j\)s to zero, we can obtain a model that is more easily interpreted. (Feature/Variable selection)
Least squares is unlikely to yield any coefficient estimates that are exactly zero.
olsrr::ols_step_all_possible()
olsrr::ols_step_best_subset()
olsrr::ols_step_forward_p()
olsrr::ols_step_backward_p()
olsrr::ols_step_both_p()
glmnet::glmnet()
pls::pcr()
pls::plsr()
Two “conflicting” goals in model building:
as many features as possible for better predictive performance on new data (smaller bias).
as few regressors as possible because as the number of regressors increases,
A compromise between the two hopefully leads to the “best” regression equation.
What does best mean?
There is no unique definition of “best”, and different methods specify different subsets of the candidate regressors as best.
The full (largest) model has \(M\) candidate regressors.
There are \(M \choose d\) possible subset models of \(d\) coefficients.
There are totally \(2^M\) possible subset models.
An evaluation metric should consider Goodness of Fit and Model Complexity:
Goodness of Fit: The more regressors, the better
Complexity Penalty: The less regressors, the better
For a model with \(d\) predictors,
\[\begin{align} C_p &= \frac{SS_{res}(d)}{\hat{\sigma}^2} - n + 2d \\ &= d + \frac{(s^2 - \hat{\sigma}^2)(n-d)}{\hat{\sigma}^2} \end{align}\]
\(\hat{\sigma}^2\) is the variance estimate from the full model, i.e., \(\hat{\sigma}^2 = MS_{res}(M)\).
\(s^2\) is the variance estimate from the model with \(d\) coefficients, i.e., \(s^2 = MS_{res}(d)\).
Favors the candidate model with the smallest \(C_p\).
For unbiased models that \(E[\hat{y}_i] = E[y_i]\), \(C_p = d\).

For a model with \(d\) predictors,
Akaike information criterion (AIC) is \[\text{AIC} = n \ln \left( \frac{SS_{res}(d)}{n} \right) + 2d\]
Bayesian information criterion (BIC) is \[\text{BIC} = n \ln \left( \frac{SS_{res}(d)}{n} \right) + d \ln (n)\]
BIC penalizes more when adding more variables as the sample size increases.
BIC tends to choose models with less features.
PREdiction Sum of Squares (PRESS)
\(\text{PRESS}_d = \sum_{i=1}^n[y_i - \hat{y}_{(i)}]^2 = \sum_{i=1}^n\left( \frac{e_i}{1-h_{ii}}\right)^2\) where \(e_i = y_i - \hat{y}_i\).1
PRESS is in fact the LOOCV estimate of test MSE!
\(R_{pred, d}^2 = 1 - \frac{PRESS_d}{SS_T}\)
Assume the intercept is in all models.
If there are \(M\) possible regressors, we investigate all \(2^M - 1\) possible regression equations.
Use the selection criteria to determine some candidate models and complete regression analysis on them.
Index N Predictors R-Square Adj. R-Square Mallow's Cp
3 1 1 x3 0.972 0.970 20.38
1 2 1 x1 0.971 0.970 21.20
2 3 1 x2 0.893 0.886 114.97
4 4 1 x4 0.884 0.877 125.87
5 5 1 x5 0.335 0.290 785.26
10 6 2 x2 x3 0.987 0.985 4.94
6 7 2 x1 x2 0.986 0.984 5.66
14 8 2 x3 x5 0.985 0.983 7.29
9 9 2 x1 x5 0.984 0.982 8.16
13 10 2 x3 x4 0.975 0.972 18.57
8 11 2 x1 x4 0.974 0.970 20.04
7 12 2 x1 x3 0.973 0.969 21.99
11 13 2 x2 x4 0.931 0.921 72.29
12 14 2 x2 x5 0.924 0.913 80.30
15 15 2 x4 x5 0.910 0.898 96.50
23 16 3 x2 x3 x5 0.990 0.988 2.92
18 17 3 x1 x2 x5 0.989 0.987 3.71
16 18 3 x1 x2 x3 0.987 0.984 6.21
22 19 3 x2 x3 x4 0.987 0.984 6.90
17 20 3 x1 x2 x4 0.986 0.983 7.66
20 21 3 x1 x3 x5 0.985 0.982 8.97
25 22 3 x3 x4 x5 0.985 0.982 9.00
21 23 3 x1 x4 x5 0.985 0.981 9.41
19 24 3 x1 x3 x4 0.978 0.974 16.80
24 25 3 x2 x4 x5 0.952 0.941 48.28
30 26 4 x2 x3 x4 x5 0.991 0.988 4.03
28 27 4 x1 x2 x4 x5 0.991 0.987 4.26
27 28 4 x1 x2 x3 x5 0.991 0.987 4.35
26 29 4 x1 x2 x3 x4 0.988 0.984 7.54
29 30 4 x1 x3 x4 x5 0.985 0.980 10.92
31 31 5 x1 x2 x3 x4 x5 0.991 0.987 6.00olsrr::ols_step_all_possible()
olsrr::ols_step_best_subset()
olsrr::ols_step_forward_p()
olsrr::ols_step_backward_p()
olsrr::ols_step_both_p()
All possible selection may be computationally infeasible.
Other subset selection methods do not explore all possible subset models. (No global solution)
Ridge regression does shrink coefficients, but still include all predictors.
Lasso regularizes coefficients so that some coefficients are shrunk to zero, doing feature selection.
Like ridge regression, for a given \(\lambda\), Lasso only fits a single model.
Different from the Ridge regression that adds \(\ell_2\) norm, Lasso adds \(\ell_1\) penalty on the parameters:
\[\begin{align} \widehat{\boldsymbol \beta}^\text{l} =& \, \, \mathop{\mathrm{arg\,min}}_{\boldsymbol \beta} \lVert \mathbf{y}- \mathbf{X}\boldsymbol \beta\rVert_2^2 + n \lambda \lVert\boldsymbol \beta\rVert_1\\ =& \, \, \mathop{\mathrm{arg\,min}}_{\boldsymbol \beta} \frac{1}{n} \sum_{i=1}^n (y_i - x_i' \boldsymbol \beta)^2 + \lambda \sum_{j=1}^p |\beta_j|, \end{align}\]
ElemStatLearn::prostate Data| lcavol | lweight | age | lbph | svi | lcp | gleason | pgg45 | lpsa | train |
|---|---|---|---|---|---|---|---|---|---|
| -0.580 | 2.77 | 50 | -1.386 | 0 | -1.39 | 6 | 0 | -0.431 | TRUE |
| -0.994 | 3.32 | 58 | -1.386 | 0 | -1.39 | 6 | 0 | -0.163 | TRUE |
| -0.511 | 2.69 | 74 | -1.386 | 0 | -1.39 | 7 | 20 | -0.163 | TRUE |
| -1.204 | 3.28 | 58 | -1.386 | 0 | -1.39 | 6 | 0 | -0.163 | TRUE |
| 0.751 | 3.43 | 62 | -1.386 | 0 | -1.39 | 6 | 0 | 0.372 | TRUE |
| -1.050 | 3.23 | 50 | -1.386 | 0 | -1.39 | 6 | 0 | 0.765 | TRUE |
| 0.737 | 3.47 | 64 | 0.615 | 0 | -1.39 | 6 | 0 | 0.765 | FALSE |
| 0.693 | 3.54 | 58 | 1.537 | 0 | -1.39 | 6 | 0 | 0.854 | TRUE |
cv.glmnet(alpha = 1)lambda.min contains more nonzero parameters.
Larger penalty \(\lambda\) forces more parameters to be zero, and the model is more “sparse”.
Lasso solution does not have an analytic or closed form in general.
Consider the univariate regression model
\[\underset{\beta}{\mathop{\mathrm{arg\,min}}} \quad \frac{1}{n} \sum_{i=1}^n (y_i - x_i \beta)^2 + \lambda |\beta|\]
With some derivation, and also utilize the OLS solution of the loss function, we have
\[\begin{align} &\frac{1}{n} \sum_{i=1}^n (y_i - x_i \beta)^2 \\ =& \frac{1}{n} \sum_{i=1}^n (y_i - x_i b + x_i b - x_i \beta)^2 \\ =& \frac{1}{n} \sum_{i=1}^n \Big[ \underbrace{(y_i - x_i b)^2}_{\text{I}} + \underbrace{2(y_i - x_i b)(x_i b - x_i \beta)}_{\text{II}} + \underbrace{(x_i b - x_i \beta)^2}_{\text{III}} \Big] \end{align}\]
\[\begin{align} & \sum_{i=1}^n 2(y_i - x_i b)(x_i b - x_i \beta) = (b - \beta) {\color{OrangeRed}{\sum_{i=1}^n 2(y_i - x_i b)x_i}} = (b - \beta) {\color{OrangeRed}{0}} = 0 \end{align}\]
Our original problem reduces to just the third term and the penalty
\[\begin{align} &\underset{\beta}{\mathop{\mathrm{arg\,min}}} \quad \frac{1}{n} \sum_{i=1}^n (x_ib - x_i \beta)^2 + \lambda |\beta| \\ =&\underset{\beta}{\mathop{\mathrm{arg\,min}}} \quad \frac{1}{n} \left[ \sum_{i=1}^n x_i^2 \right] (b - \beta)^2 + \lambda |\beta| \end{align} \] Without loss of generality, assume that \(x\) is standardized with mean 0 and variance \(\frac{1}{n}\sum_{i=1}^n x_i^2 = 1\).
This leads to a general problem of
\[\underset{\beta}{\mathop{\mathrm{arg\,min}}} \quad (\beta - b)^2 + \lambda |\beta|,\] For \(\beta > 0\),
\[\begin{align} 0 =& \frac{\partial}{\partial \beta} \,\, \left[(\beta - b)^2 + \lambda |\beta| \right] = 2 (\beta - b) + \lambda \\ \Longrightarrow \quad \beta =&\, b - \lambda/2 \end{align}\]
\[\begin{align} \widehat{\beta}^\text{l} &= \begin{cases} b - \lambda/2 & \text{if} \quad b > \lambda/2 \\ 0 & \text{if} \quad |b| \le \lambda/2 \\ b + \lambda/2 & \text{if} \quad b < -\lambda/2 \\ \end{cases} \end{align}\]
Lasso provides a soft-thresholding solution.
When \(\lambda\) is large enough, \(\widehat{\beta}^\text{l}\) will be shrunk to zero.
The objective function is \((\beta - 1)^2\). Once the penalty is larger than 2, the optimizer would stay at 0.
The proportion of times a variable has a non-zero parameter estimation.
\(\mathbf{y}= \mathbf{X}' \boldsymbol \beta + \epsilon = \sum_{j = 1}^p X_j \times 0.4^{\sqrt{j}} + \epsilon\)
\(p = 20\), \(\epsilon \sim N(0, 1)\) and replicate 100 times.
For every value of \(\lambda\), there is some \(s\) such that the two optimization problems are equivalent, giving the same coefficient estimates.
The \(\ell_1\) and \(\ell_2\) penalties form a constraint region that \(\beta_j\) can move around or budget for how large \(\beta_j\) can be.
Larger \(s\) (smaller \(\lambda\)) means a larger region \(\beta_j\) can freely move.
Lasso \[\begin{align} \min_{\boldsymbol \beta} \,\,& \lVert \mathbf{y}- \mathbf{X}\boldsymbol \beta\rVert_2^2 + n\lambda\lVert\boldsymbol \beta\rVert_1 \end{align}\]
\[\begin{align} \min_{\boldsymbol \beta} \,\,& \lVert \mathbf{y}- \mathbf{X}\boldsymbol \beta\rVert_2^2\\ \text{s.t.} \,\, & \sum_{j=1}^p|\beta_j| \leq s \end{align}\]
Ridge \[\begin{align} \min_{\boldsymbol \beta} \,\,& \lVert \mathbf{y}- \mathbf{X}\boldsymbol \beta\rVert_2^2 + n\lambda\lVert\boldsymbol \beta\rVert_2^2 \end{align}\]
\[\begin{align} \min_{\boldsymbol \beta} \,\,& \lVert \mathbf{y}- \mathbf{X}\boldsymbol \beta\rVert_2^2\\ \text{s.t.} \,\, & \sum_{j=1}^p \beta_j^2 \leq s \end{align}\]
What do the constraints look like geometrically?
When \(p = 2\),

Lasso Soft-thresholding
\[\begin{align} \widehat{\beta}^\text{l} &= \begin{cases} b - \lambda/2 & \text{if} \quad b > \lambda/2 \\ 0 & \text{if} \quad |b| < \lambda/2 \\ b + \lambda/2 & \text{if} \quad b < -\lambda/2 \\ \end{cases} \end{align}\]

Ridge Proportional shrinkage
\[\begin{align} \widehat{\beta}^\text{r} = \dfrac{b}{1+\lambda}\end{align}\]

Perform well (lower test MSE) when
Lasso (—)
A relatively small number of \(\beta_j\)s are substantially large, and the remaining \(\beta_k\)s are small or equal to zero.
Reduce more bias

Ridge (…)
The response is a function of many predictors, all with coefficients of roughly equal size.
Reduce more variance

Lasso
Ridge
Let’s wait until we discuss Bayesian Regression!
Warning
Even Lasso does feature selection, do not add predictors that are known to be not associated with the response in any way.
Curse of dimensionality. The test MSE tends to increase as the dimensionality \(p\) increases, unless the additional features are truly associated with the response.
Do not conclude that the predictors with non-zero coefficients selected by Lasso and other selection methods predict the response more effectively than other predictors not included in the model.
\[\lambda \left[ (1 - \alpha) \lVert \boldsymbol \beta\rVert_2^2 + \alpha \lVert\boldsymbol \beta\lVert_1 \right]\]

glmnet)